-
Notifications
You must be signed in to change notification settings - Fork 12.7k
common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters #15191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: CISC <[email protected]>
Co-authored-by: CISC <[email protected]>
I didn't exhaustively test but works for my use case. Small tests show ~25% improvement for a few prompts and a Q4 draft model. |
I don't mind AI generated code, but you need to take responsibility of the review yourself. |
Sure, it was a test to see how well Copilot can resolve simple issues without too much interaction, went fairly well I think, I just need to be more hands on next time. :) |
…slaren's feedback) Co-authored-by: ggerganov <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Wouldn't it be coherent to have the short form |
…e-draft parameters (ggml-org#15191) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Diego Devesa <[email protected]>
CLI Flags Now Working
All three speculative draft model offload flags are now fully functional:
--override-tensor-draft
: Specify tensor buffer type overrides for the draft model--cpu-moe-draft
: Keep all MoE weights in CPU for the draft model--n-cpu-moe-draft N
: Keep MoE weights of first N layers in CPU for the draft modelExample Usage
Entrypoints Updated
All speculative decoding entrypoints properly apply draft-specific tensor overrides:
examples/speculative/speculative.cpp
examples/speculative-simple/speculative-simple.cpp
tools/server/server.cpp
Validation Results
The implementation ensures draft model tensor overrides are completely independent from main model overrides, enabling flexible heterogeneous hardware setups and advanced MoE configurations for speculative decoding workflows.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.